Inworld TTS

Overview
Inworld TTS is a real-time text-to-speech system built by Inworld AI. Inworld is a paid TTS service, a mid-tier option a bit higher quality than XTTS, and much cheaper than ElevenLabs. The service is credit-based with no subscription. As of March 2026, you get $2 credits for free when signing up. Note that Inworld takes quite awhile to clone voices, 10-15 seconds per voice. The first time you speak to an npc with a new voice type the response will be delayed - it should be fast for subsequent generations.
Within SkyrimNet-style setups, it represents a fully managed, cloud-based alternative to local solutions like XTTS or Piper.
Key Features
1. Real-Time Streaming
- Designed for low latency
- Supports streaming audio output
- Characters can begin speaking before text generation finishes
2. Character-Native Design
- Built to work with Inworld’s AI character system
- Speech is generated as part of a unified pipeline:
- Dialogue → Emotion → Voice output
3. Fully Managed Cloud Service
- No local model setup required
- Hosted inference via API
- Handles:
- Scaling
- Optimization
- Updates
Model Variants
Inworld TTS 1
- First-generation system
- Focus on:
- Low latency
- Stable real-time performance
- Pros:
- Fast and reliable
- Good conversational quality
- Cons:
- Less expressive than newer models
- More limited emotional range
Inworld TTS 1.5
- Improved version with better prosody and realism
- Enhancements:
- More natural pacing
- Better emotional transitions
- Improved voice consistency
Inworld TTS 1.5 Max
- Highest-tier offering
- Focus on maximum expressiveness and realism
Improvements over 1.5:
- Richer emotional depth
- More nuanced delivery (pauses, emphasis, tone shifts)
- Better handling of:
- Long-form dialogue
- Complex conversational context
Trade-offs:
- Slightly higher latency than base models
- Higher cost (API usage)
Integration Characteristics
Typical Workflow
- Send dialogue text (often with context/metadata)
- Inworld processes:
- Intent
- Emotion
- Character state
- TTS generates streamed audio output
Compared to SkyrimNet Local TTS
- No need for:
- Voice sample management
- Model hosting
- GPU setup and vram usage
Strengths
- ✔️ Ultra-low latency streaming
- ✔️ Strong emotion and personality modeling
- ✔️ No setup or hosting required
- ✔️ Consistent voice quality out of the box
- ✔️Cloning is automatic and can conserve voice fx effects, like echos and reverbs
Limitations
- ❗ Requires cloud connectivity
- ❗ Ongoing API cost , though its cost is very affordable
Quick Setup
- Sign up for an account on the Inworld TTS website.
- Click the API Keys link in the bottom left of the site and click Generate new key. Create it with Write permission.
- In SkyrimNet's Test & Easy Setup page, set the TTS Backend dropdown to Inworld and hit save.
- In SkyrimNet's Advanced Configuration page, go to NPC Voices -> Inworld TTS -> Connection and set both your Workspace ID and Basic (Base 64) keys from the API Keys page on Inworld's website. Save the changes.
- Also in the Inworld TTS configuration page, you can change the TTS -> Model ID setting to
inworld-tts-1-maxfor higher quality (and 2x the cost). Voice -> Enable Audio Tags can also add more emotional quality.
Comparison (SkyrimNet Context)
| Feature | Inworld TTS | XTTS | Piper | Zonos |
|---|---|---|---|---|
| Speed | Very fast (streaming) | Medium | Very fast | Slow |
| Quality | High (conversational) | Good | Lower | High |
| Emotion | Native / automatic | Limited | Minimal | High (manual) |
| Voice Cloning | Yes, with effects | Yes | No | Yes |
| Setup | None (cloud) | Moderate | Easy | Complex |
| Offline Support | No | Yes | Yes | Yes |
Notes
- Best results are achieved when used with Inworld’s full character system
- Voice output is influenced by AI state, not just raw text input
Overview
Audio Tags in Inworld TTS are inline annotations (e.g., [whisper], [laugh]) that modify how a line is spoken, not what is said.
They allow you to inject paralinguistic cues directly into dialogue, influencing delivery such as tone, emotion, and non-verbal sounds.
How They Work
- Tags are written inside square brackets, if enabled they will be created by the dialogue llm , being sent for the Inworld TTS.:

Bottom Line
Inworld TTS is a real-time, character-aware speech system that prioritizes:
- Emotion
- Responsiveness
- Conversational realism
It is ideal if you want:
- Plug-and-play setup
- Emotionally expressive NPCs
- Streaming dialogue with minimal latency
- No resource cost , since its an external service